Metasurface-driven full-space structured light for three-dimensional imaging

Structured light (SL)-based depth-sensing technology illuminates the objects with an array of dots, and backscattered light is monitored to extract three-dimensional information. Conventionally, diffractive optical elements have been used to form laser dot array, however, the field-of-view (FOV) and diffraction efficiency are limited due to their micron-scale pixel size. Here, we propose a metasurface-enhanced SL-based depth-sensing platform that scatters high-density ~10 K dot array over the 180° FOV by manipulating light at subwavelength-scale. As a proof-of-concept, we place face masks one on the beam axis and the other 50° apart from axis within distance of 1 m and estimate the depth information using a stereo matching algorithm. Furthermore, we demonstrate the replication of the metasurface using the nanoparticle-embedded-resin (nano-PER) imprinting method which enables high-throughput manufacturing of the metasurfaces on any arbitrary substrates. Such a full-space diffractive metasurface may afford ultra-compact depth perception platform for face recognition and automotive robot vision applications.


Supplementary Note 1. The effect of pixel pitch on diffraction behavior.
and are the pixel pitch determined from Nyquist sampling theorem and grating equation, respectively, give criteria for the choice of pixel pitch of the single supercell. The light transmitted through the metasurface can be considered as a signal with a bandwidth of 2 0 , where 0 is the free-space wavenumber. If such a band-limited signal is sampled with a sampling frequency of , the signals are added to the spectrum with an interval of , thus should be larger than 2 0 to perfectly reconstruct a signal of 2 0 bandwidth. In other words, needs to be smaller than half of the free-space wavelength, resulting in < 316 nm. On the other hand, high-order diffraction occurs even in a single supercell when the pixel pitch is larger than , as derived from the transmission grating equation described as where and represent the refractive index of the transmitted and incident media, and denote the angles of the m th order diffraction and incident beams, 0 is the vacuum wavelength, and P is the pixel pitch. To prevent high-order diffraction in a single supercell under normal incidence, should be smaller than 0 , resulting in the condition < 633 nm.
We simulate the effect of pixel pitch on the intensity distribution of the diffracted beams by varying the pixel pitch from 300 nm to 700 nm with an interval of 100 nm with the optimized meta-atom. The range of pixel pitch spans over and . Below 300 nm, the neighboring nanostructures are electromagnetically coupled, in other words the evanescent electromagnetic waves of nanostructures are not diminished before reaching the neighboring nanostructures 1 .
The number of pixels of a single supercell is selected to allow comparing the intensity of diffracted beams at the same diffraction order (Supplementary Table 1). The pixel pitch can be divided into three regimes: (1) < , (2) < < , and (3) < . The intensity, angle, and number of diffracted beams which are pre-determined from the single supercell appear when repeated periodically to form a 4 by 4 supercell array (Supplementary Figure 1).
The number of supercell arrays is selected as four, due to computational limitations to simulate using the FDTD method. In the first regime, all diffracted beams propagate with a moderately uniform intensity, however, the intensity drops at large angles, which is unavoidable due to the decreased number of sampling points with a fixed sampling frequency. In the second regime, the decrease in intensity of higher-order diffraction is steeper due to the decreased resolvable spatial frequency 1 2 . In the third regime, the decrease in intensity is much steeper and it should be noted that unwanted higher order diffraction occurs at the largest diffraction orders, originating from the single supercell.

Supplementary Table 1. Simulated conditions of pixel pitch and number of pixels.
To compare the intensity of diffracted beams for various pixel pitch P at the same order m, the number of pixels n is adjusted to sample signal frequency 1/λ with identical sampling frequency 1/nP. 1/2P denotes the resolvable spatial frequency at the metasurface.

Supplementary Note 2. Comparison of commercial DOE products and metasurfacebased structured light projectors.
The key performance metrics of structured light projection are field of view (FOV), diffraction efficiency, zeroth-order efficiency, and spot intensity uniformity. The measured FOV of our 2D full-space diffractive metasurfaces reaches 180° with an overall diffraction efficiency of 60%.  Table 2). Both diffraction efficiency and spot uniformity of our metasurface are comparable to state-of-the-art DOE products and previously reported metasurface-based structured light projectors, while our 2D full-space diffractive metasurfaces exhibit a record for the FOV in the transmissive regime. It should be noted that the deflection efficiency of 46.5% is measured normalized to the transmitted light due to low transmission, originating from the bonding process of immersion lithography technology for the values in ref 3 .

Supplementary Table 2. Previously reported metasurface-based structured light projectors and commercial DOE products.
where is the i th order diffraction intensity normalized to the incident light intensity, and M is the total number of diffraction orders.

Supplementary Note 3. Point matching algorithm
By projecting 2D dot arrays, the geometrical information of the object scene is obtained in a 2D point cloud-form. Therefore, the correspondence problem in the stereo system can be solved by adopting various point set registration algorithms. In depth estimation experiment presented in the main text, we use coherent point drift (CPD) algorithm 9 .
In CPD, the alignment of the two point sets are modeled as probability density estimation problem. One point set is represented by Gaussian mixture model (GMM) centroids, and the other point set is fitted to best match the first point set by moving coherently as a group.
Here, we denote the point set from camera 1, 2 as , respectively and set as GMM centroids. is considered as the data points generated by the GMM. If the total number of points in the two point sets are given as and , then the GMM probability function for a point is expressed as, Where in D = 2 dimension, ( | ), the Gaussian distribution centered on point ∈ with equal isotropic covariances of 2 is, For all GMM components, the membership probability ( ) = 1 M is equal. Denoting the weight of the uniform distribution as (0 ≤ ≤ 1), the mixture model is then, The GMM centroid locations are re-parameterized with a set of parameters , and estimated by minimizing the negative log-likelihood function with the assumption that the data is independent and identically distributed.
Here, can be parameters that defines the transformation such as rotation matrix, translation vector and scaling coefficients.
The correspondence probability between two points and is defined as the posterior probability of the GMM centroid given the data point: Using the expectation of maximization (EM) algorithm 10 . (7) The EM algorithm are iteratively conducted by alternating between the two steps until it converges.
If we ignore the constants that are independent of and 2 , the above equations can be written as, (with = only if = 0) and denotes the posterior probabilities of GMM components calculated using the previous parameter values: where = (2 2 ) /2 1− . We define ( ; , ) = + as affine transformation between the two sets, where